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ABSTRACT V 

Techniques for maximizing -discrimination between 
groups in norm referenced measurement to Reflect sensitivity to group 
differences inherent in the evaluation of multilevel educational 
systems are discussed. Data from the Beginning Teacher Evaluation 
Study (BTES) were used to examine relationships between reading and 
mathematics achievement and instructional variables. Intraclass 
cprrelation, correlations of class means on items and on total scale, 
and correlations of class means on items and on alloted time in 
instruction were used as item selection techniques to form subscales 
of the fraction test of the BTES. By constructing the subscales on 
the basis of the between-class relationship of the items to 
instruction, the sensitivity of the scale to between-class 
differences in instruction was found to increase and the sensitivity 
to the same two variables within class decreased. These results 
indicate the use'f uliless of selecting items on the basis of their 
relationship to the variable of interest. (Author/CM) 
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While tests are used to assess the achievement differences between 
individuals, as well as ranking the achievement differences among 
aggregates of individuals, such as classrooms, schools or programs, 
the psychometric model used in the construction of norm-referenced 
tests has focused primarily on the former. The use of the individual 
as the unit of analysis in test construction combined with the largely 
negative results of schptfl effects studies and large scale evaluations 
about the relationship of school inputs to pupil outcomes (Averch 
et al., 19^72; Coleman et al . , 1966; Jencks et al;/, 1972; Stebbins 
et al., 1977) has caused many educational researchers to reexamine the 
statistical techniques and models used to arrive at these conclusions. 

A concern over the possible mismatch between the methods used to 

construct norm-referenced tests and the kjnds of issues being addressed 

has led to questions about the program relevance and instructional 

sensitivity of norm-referenced measurement (Airasian & Madaus , *1976; 

Berliner, ^97B; Carver, 1974j Hanson & Schutz. 1978; Leinhardt 

i . 
& Seewald, 1981; Madaus et*al., 1979, 1980; Porter et al., 1978). 

* 

This concern over the sensitivity of tests to instructional and program 
effects is evident in .recent investigations of the overlap between 
test content and instructional 'content. These studies indicate that 
test performance is higher when there is substantial overlap between 
test content and instructional content (Armbruster eft al., 1977; 
Jenkins & Pany, 1976; Leinhardt & Seewald, 1981; Madaus et al., 1979; 
Walker & Schaffarzik, 1974). This evidence in conjunction with the 
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finding that there is wide variation in content coverage in the major . 
standardized achievement tests (Porter et al . , 1978) raises the question 
of whether schools are skilled at or successful at selecting the test 
that best fits their curriculum or whether "this is even possible. 
Moreover, as long as teachers have the freedom to choose which topics 
to cover and emphasize within a subject area, tests may not be useful 
or relevant for measuring between-class differences. 

Another concern about norm-referenced measurement has centered 
around the empirical methods used to construct tests. Some critics have 
argued that tests designed to differentiate among individuals can 
maximize the wi thin-school differences relative to the between-school 

or between-program differences (Cairver, 1974; Lewy, 1973). Theoretically, 
of course, there is no reason to assume that a test designed to measure 
individual differences cannot also measure school or program differences. 
However, the bulk of the evidence from school effectiveness studies 
suggests either that school^or^ogram differences are small or do not 
exist after controlling for home background and entering ability or 
that the between-group differences are not being measured properly 
(Madaus et al . , 1980). 

One approach to improving the sensitivity of measures of group 
differences might be to consider the inherent multilevel character 
of the educational system. .That is students are nested within classr 
rooms; classrooms are nested within schools, etc. Analyses can be 
conducted both between and within each of the levels of the educational 
system and analyses within and between the different levels can have 
different substantive meanings (Burstein, 1978, 1980; 'Burstein, Fischer, 



& Miller, 1980; Cronbach, 1976). Thus, if analyses are not conducted 
from a multilevel perspective, one can fail to cl-early identify important 
effects occurring at different levels. Because of a concern for the 
analyses of data at multiple levels, many major evaluations, such as 
Project Follow Through (Haney, 1974) and the National Day Care Study 
(Singer & Goodrich, 1979), have devoted considerable time and expense 

to the selection of the unit of analysis. Since education does affect 

i 

student outcomes between and within all levels of the educational system, 
it has been argued that evaluations of educational data should look at 
more than one level of analysis^ for a more complete understanding of 
the determinants of student achievement. In fact, Cronbach ( 1976,, p. 1) 
'argued that the "majority of studies of educational effects whether 
classroom experiments, or evaluations 'of programs or surveys — have 
collected and analyzed data in ways that conceal more than they reveal. 
The established methods have generated false conclusions in many 
studies. 11 

While there has been a rapid rise in the concern for multilevel 
issues in large scale evaluations and school effects studies (see 
Burstein, 1980 for a review), most researchers have ignored the issue 
of multilevel data analysis in the construction of tests and the 
analysis of item data, with a few notable exceptions. In hi? monograph 
on multilevel issues, Cronbach (1976; p. 9.19-9.20) discussed the 
possible utility of multilevel item analysis: ' 

Once the question of units is .raised, all empirical test construc- 
tion and item-analysis procedures n£ed to be reconsidered. Is it 
better to retajn items that correlate across classes? Or items 

6 



that correlate within classes? A correlation based on deviation 

» 

scores within classes indicates whether students who comprehend 
one point better than most students also comprehended the second 
point better than most — instruction being held constant. A 
correlation between classes indicates whether a class that learned 
one thing learned another, but this depends first and foremost 

on what teachers assigned and emphasized. It is the items teachers 

i 

give different weight to that have the greatest variance across 
classes. This (differential emphasis) leads us to regard the 
between-group and wi thin-group correlations of items as conveying 
■different information, and makes the overall correlation for 
classes pooled an uninterpretable blend. 

As Cronbach (1976) suggests, it may be useful to reexamine the 
empirical techniques used in item analysis and test construction in a 
multilevel contest. Hence, instead of using indices of item dis- 
crimination between subjects in test- construction, indices of item; * 
discrimination between groups fnay prove more useful in^buil ding scales 
more sfensitive to differences between groups. One test construction - 
technique for building tests more sensitive to betw^en-group differneces 
was^ suggested by Lewy (1973). Since the purpose of the test is to 
discriminate between groups, Lewy suggested an index of how well items 
discriminate between groups as a criterion for inclusion i(i the test - 
the intraclass correlation/ The intraclass correlation is equal to the 
^proportion of variation in an item that is attributable to group differences 
Thus, the intraclass correlation coefficient equals one when all scores 

within each group are identical anc| the only variance is due to differences 

i * 

between groups. Conversely, the intraclass correlation coefficient equals 
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zero when„an the group means are equal and the only variation is due 

to. differences within a group. Lewy proposed the intraclass coefficient 

to be used to identify subsets of items that maximize the variance between- 

groups on the tdtal test relative to the total test score variance. 

While the intraclass correlation coefficient may be a useful 

index of how well an item differentiates between groups, using it as 

the sole criterion for item selection may be overly simplistic. As 

i 

flirasian and Madaus (1976) point out, items may differentiate between 
groups in different directions, so that they fail, to discriminate between 
groups when summed into a single composite. For exampl^given two* 
groups of equal size, if everyone in one group answered one item 
(Wrectly and the other item incorrectly, while the reverse was true 
for the second group,- then the two items would each .have an intraclass 
correlation of one, but the sum of the items would not discriminate 
between groups at all. Because of this phenomenon, Airasian and 
Mactaus suggested using the between-group ihtercorrelations of the items 
along with the intraclass correlations. It could even be argued that 
the intercorrelations are a more important piece of information since 
the variance of an n-item scale is equal to n item variances and n(n-l) ■ 
item covariances. So the between-group item intercorrelations could 



be used to develop a scale which maximized the variance between groups. 



Using the item between-i 




intercorrelations will create a 



scale which is internally consistent for discriminating groups. This 
procedure can rapidly become unwieldly, however, since there are n(n-l)/2 
intercorrelations between *n items. Because of this, a procedure that 
has been used to build internally consistent scales for measuring 
individual differences might also" be applied to build an internally 
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consistent scale for measuring differences between groups . % That is, 
rather tfian the point-bi serial correlation between the items and the 
total scale, the correlation between the wi'eghted itemjgroup means 
and the group me^ns on the total scale could be used. Thus, the infor- 
mation needed for any decisions is reduced from n(n-l)/2 to n. 

One final approach to item selection would be to use some criterion 
external to the test for item selection. For example', the Beginning 
Teacher Evaluation Study (BTES) had some success in developing scales 
sensitive to instructional differences between individuals (BTES: Filby 
& Dishaw, 1975, 1976). However, in the BTES study, all instructional 
variables were measured at the student level (e.g., allocated time). 
Because this is not always possible due to practical considerations 
(e.g., the time and expose that would be needed in a larger study), as 
well as the fact that many instructional variables cannot be measured 
at the ^student level (e.g., number of aides or money invested), the 
criteria used in item selection might be group-level measures (e.g., 
instructional materials) or even aggregate measures- of individual-level 
variables (e.g.,' opportunity to learn). Even when^the individual-level 
measured of the instructional variables (e.g., instructional time) are 
available for the item tryout, the relationship of the items to the 
aggregate measure might be lised fon item selection, if the^unit of 
analysis is the aggregate in the final study/ 
Data Analysis j V/\ • 

Samp]_e. The' Beginning Teacher Evaluation Study' (BTES: Fischer 
et al., 1978) was. sponsored by the California Commission for Teacher 
Preparation and Licensing with funds from the National Institute of 
Education. The. study was conducted to examine the relationship of 
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reading and mathematics achievement to instructional variables in 
grades 2 and 5. Fractions was a subject area in which a great deal of 
time and* effort ^re expended in many fifth grade classrooms. Tests 
were administered to six students in each of 25 second and 25 fifth 

graie classes on four occasions — (A) October, 1976; (B) December, 
1976; (C) May, 1977; and (D) September, 1977. SinCe there was very little 
fraction instruction until after December, the October testing did not 
include the fractions subtest. In addition to the achievement tests, 
measures of allocated time, engagement rates and success rates were 
obtained. Also, teacher behavior measures were collected. To reduce 
the variability due to initial ability and home background, students 
were not selected "V/ho scored extremely low or extremely high on a 
selection test at the beginning of the year. Selected students were 
roughly between the 30 and 70 percentile of the overall distribution 
from all classes. 

The fraction subtest data consisted of fifteen items administered 
on three occasions. The skills tested included fraction addition, 
fraction subtraction, reducing fractions and finding the missing 
numerator or denominator in a fractional equation. Data was obtained 
from 127 students on occasion B (December, 1976), 123 students on 
occasion C (May, 1977), and 89 students on occasion D (September, 1977). 
The students were drawn from 21 classrooms. 

In addition, the pilot data will be used for the test construction. 
Because of an interest by the BXES in instructional variables, special 
• effort was made to develop instructional ly sensitive measures. (BTES: 
Fi Tfby^& Dishaw, 1975, 1976). Two criteria were used to enhance the 
ljkelihood that the tests would be instructional ly sensitive. First, 
item content was checked to be sure that instructional content and 




test content overlapped. Next, items were checked to see if gains in 

achievement were related to gains in instruction (Carver, 1974). 
* ■ \ 

This second criterion involved two assumptions. First,. students would 

perform better after instruction than before instruction. Second, 

students who receive more instruction would achieve higher than 

students v^hp received less instruction. Consequently, the pilot 

study, conducted in April, 1975, included both test item data and 

a measure of allocated time. The sample included 72 subjects drawn 

. * ** 
from 6 classrooms. 

Data Analysis . Three of the item selection techniques outlined 

above will be used to form subscales of the fraction test. Items 

will be selected on the basis of their characteristics in the spring 

testing of the pilot study and the corresponding scale will be examined 

in the spring testing of the final BTES study. The three criteria , 

used in item selection are: 

.(1) the ability °f the item to discriminate between groups by 

itself (i.e., intraclass correlation); 

(2) how the item discriminates in relationship to the total 
scale (e.g., correlation of class means-on items and class 

4 * means on total scale); and, 

(3) whether the item. discriminates between classes that vary 
in instruction (e.g., correlation of class means on items 
and class means on allocated time in fraction instruction)-. 

The primary criterion used to judge the utility of these test construc- 
tion methods will be the intraclass correlation of the formed scale. 
However, when the correlation of the mean allocated time and the 
item means by classroom is used for, item selection, the resulting 
scale's relationship to instructional variables in the final study will 
© . also be. examined. ' ^ „ , 
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Results and Discussion * • \ * . - ' \ \**v*/A* \$&*\\ % *' 

Three properties of the' items were used to form scales'. ?hfi V^>*^ % * f :y» tt| 
first criterion was trie'propdrtion of variance in the item a'ttribfttabffeT * v> ^ 
to trie differences between classes - the intracl ass .correlation./ s ttte ^ e u * 4 «? t 

**• •> 4 *<»I 

second criterion was the relationship of the item to the tota> scale - .\ * 
the correlation of the clasjs means on the item with .the class mians * /V/£;* 

on the total scale\ The third criterion was the relationship*^ thfc , « 
item to another variable - the correlation of the class means on trie A # °, % \ 

item with the class means on titoe allocated to fractions/ The descriptive " * 
statistics us eft in the item selection are contained in Tables 1/2, 4 * % " , 
and 3. * - ' - 

The intfaclass correlation for the fifteen item scale in the final < "J 
study was .47. Forming scales from the item. intraclass correlation did, 
not increase the ratio of between class variance to total variance >V % " 

between subjects. Selecting the ten items with an intraclass .correlation* I 
greater than or equal to .10 or the four items with intraclass correla- , ^ t * 
tions greater* than or equal to ".15 led to an intraclass correlation on 
the scale of .46 and .44, respectively. Similarly, selecting i|ems> 
so that the between-class item- total scale correlatioh was greater than 
or equal to .75 and .80 led to scales with an intraclass correlation of • ** 

.45 (9 items) and .42 (5 items), respectively: Finally, .selecting • 

• • ■ 

the fotir items with a positive between-class correlation o^ allocated 
time and the item (p:>.05) led to a scale with an intraclass correlation 
of .42. Hence, selecting items on the ba&is of their statistical ' 
properties does not seem to increase the proportion of variance in the - / *\ 
scale that is due to group differences. , / . 
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However, selecting items on the basfs of their relationship to 
another variable did increase the sensitivity of the scale to the variable 
of interest. In Table 4, the fifteen item scale and the four item 
scale formed by the between-class correlation of mean allocated time 
and the item means are predicted from the same set of variables - the 
pretest, allocated time, engagement rate, hard time, and easy time. By 
examining the standardized regression coefficients, it can be seen that 
the greatest differences in the prediction of the two scales is in their 
sensitivity to 'two variables which are similar to the criterion used in 
item selection -r allocated time and engagement rate. By constructing 
the scale on the basis of the between-class relationship .of the items 
to instruction, the sensitivity of the scale to between-class differences 
in instruction is increased and'the sensitivity of the scale to the same 
two variables within-class is decreased. «$ius, if the object is to 
determine the relationship of achi&wient to differences between classes 
in instruction, it may be useful to p^gst th£ items for the achievement 



test on the basis of 'their relationship to the variable of interest 
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Table*!. BTES item intraclass correlations (n 2 ) on 
on the spring pilot testing. 



Item 
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Item 
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Item 


ni 
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.14 
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.03 


11 


.11 
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.11 
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.16 


12 


.11 
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.08 
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.18 


13 


.24 
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.11 
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.06 


14 


. .05 


5, 


.04 


10 


J8 


15 ' 


.10 




BTES between-class item- total 
on the spring pilot testing. 



(p) correlations 



Item 


_£_ 


Item 
-* — 


•_P 


Item 


_£_ 
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.91 


6 


.94 


11 , 


.79 


2 


,88 


- 1 ■ 


.54 


12 


.45 


3 


.97 




.83 • 


13- 


;56 


4 


.77 


9 


.77 


14 


.46 


5 


.66 


10 


.75' 


15 


.45 



fafile 3. 



Item 

1 
2 
3 
4 
5 



BTES between-cUss correlation (p) of the item 
and time alloca/ted to fractions from the seeing 
pilot study. 



-<53 
-.70/ 
.51' 
.45 



I tern' 




Item 


_£_ 


6 


7 .35 


11 


.03 


7 


: ,16 


12 


-.05 


8 


-.20 • 


13 ; 


-.42 


9 


J 9 


14 


-.51 


10 


-.64 


15 


-.49 
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Table 4. Prediction of the spring achievement test and a subscale from the pretest and 

instructional variables from the BTES final study. 3 . 





Total 




Items 4, 


5, 7, 


& 9 


5 

Within Class 


% 

I Ins tJfridardi 7Pd 

UNO UU« >UU i U 1 «.GWl 
* 


^tandardi 7Pd 


Unstandardized Standardized 


Pretest 


.39 


.34 


.16 




.43 




(11 78) 


i 


\l /.uu 






Ml IULuLcU i IlilC 


m 


• ro « 


-.00 




-.03 








(.06) 






ci ly ay cihci i l rva uc 


Cm . <J\J 


16 


-.11 




no 


t 
t 


(1.94) 










Hard Time 


-11.76 


-.16 


-3.51 




-.14 




' (L0, .l 










Easy Time 


2.44 ■ 


.11 • 


.59 




.08 " 


(3.88) 


* 

1 


(.50) 






Kotuoon Place 












Pretest v 


, -24 


.14 


.02 




.03. 


(.45) 




(1.52) 






Allocated Time 


.02 


.24 


.01 




"4' 

■ .38 




(.67)" 




(1.52) 


\ " 




Engagement Rate 


1.07 
(.05) 


'.05 


1.87 
(1.15) 


.27 










Hard Time 


-1.91 


, -.Q6 


.68 




•07 . _ 




• (.06) 




(.06) 






Easy Time 


-8.74 


-.06 


-6.23 




-.12 




(.10) 




(.43) 






R 2 


.52 




.47 







a F- tests in parenthesis. 
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